Split Dataset for Model Development Quizzes

Training set composition

SOLUTION:

33.33% no tumor, 33.33% benign tumor, 33.33% malignant tumor

Splitting training and validation data

I have 1,000 images that I can use to train and validate a CNN on to classify between the presence or absence of calcifications in a mammogram. In the real world, calcifications are only prevalent about 30% of the time, and my dataset of 1,000 images reflects this. In order to maximize my data and throw away as little as possible, how should I split my data using the 80-20 split rule?

270 images with calcifications in my training set, 30 with calcifications in my validation set

60 images with calcifications in my training set, 240 with calcifications in my validation set

150 images with calcifications in my training set, 150 with calcifications in my validation set

240 images with calcifications in my training set, 60 with calcifications in my validation set

SOLUTION:

240 images with calcifications in my training set, 60 with calcifications in my validation set